Concentration Bounds for Unigrams Language Model

نویسندگان

  • Evgeny Drukh
  • Yishay Mansour
چکیده

We show several high-probability concentration bounds for learning unigrams language model. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O ( k √ m ) . We improve its dependency on k to O ( 4 √ k √ m + k m ) . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O ( 1 k + √ k m ) . We derive a combined estimator, which has an error of approximately O ( m− 2 5 ) , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigrams model. We show that its error has a high-probability bound of approximately O ( 1 √ m ) , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Conversation-based Language Modeling Using a Loss-sensitive Perceptron Algorithm

Discriminative language models using n-gram features have been shown to be effective in reducing speech recognition word error rates (WER). In this paper we describe a method for incorporating discourse-level triggers and topic designations into a discriminative language model. Triggers are features identifying re-occurrence of words within a conversation. Topics represent clusters of related c...

متن کامل

Evaluating Topic Modeling as Preprocessing for a Sentiment Analysis Task

Classifying the sentiment of documents is a well-studied problem in Natural Language Processing (NLP). The existence of excellent discriminative classifiers like Maxent has pushed the main body of research in the direction of feature engineering. In this paper, I examine an unusual class of features, the document-topic proportions assigned by the Latent Dirichlet Allocation topic model. In part...

متن کامل

Ideological Phrase Indicators for Classification of Political Discourse Framing on Twitter

Politicians carefully word their statements in order to influence how others view an issue, a political strategy called framing. Simultaneously, these frames may also reveal the beliefs or positions on an issue of the politician. Simple language features such as unigrams, bigrams, and trigrams are important indicators for identifying the general frame of a text, for both longer congressional sp...

متن کامل

Author identification in short texts

Most research on author identification considers large texts. Not many research is done on author identification for short texts, while short texts are commonly used since the rise of digital media. The anonymous nature of internet applications offers possibilities to use the internet for illegitimate purposes. In these cases, it can be very useful to be able to predict who the author of a mess...

متن کامل

Language identification with limited resources

Language identification is an important issue in many speech applications. We address this problem from the point of view of classification of sequences of phonemes, given the assumption that each language has its own phonotactic characteristics. In order to achieve this classification, we have to decode the speech utterances in terms of phonemes. The set of phonemes must be the same for all th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004